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Abstract. Support vector machines (SVMs) appeared in the early 
nineties as optimal margin classifiers in the context of Vapnik's sta- 
tistical learning theory. Since then SVMs have been successfully ap- 
plied to real- world data analysis problems, often providing improved 
results compared with other techniques. The SVMs operate within the 
framework of regularization theory by minimizing an empirical risk in 
a well-posed and consistent way. A clear advantage of the support vec- 
tor approach is that sparse solutions to classification and regression 
problems are usually obtained: only a few samples are involved in the 
determination of the classification or regression functions. This fact 
facilitates the application of SVMs to problems that involve a large 
amount of data, such as text processing and bioinformatics tasks. This 
paper is intended as an introduction to SVMs and their applications, 
emphasizing their key features. In addition, some algorithmic exten- 
sions and illustrative real-world applications of SVMs are shown. 

Key words and phrases: Support vector machines, kernel methods, 
regularization theory, classification, inverse problems. 



1. INTRODUCTION 

In the last decade, support vector machines (SVMs) 
have increasingly turned into a standard methodol- 
ogy in the computer science and engineering com- 
munities. As Breiman [12] pointed out, these com- 
munities are often involved in the solution of con- 
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suiting and industrial data analysis problems. The 
usual starting point is a sample data set {(xj,yj) £ 
X x Y}f =1 , and the goal is to "learn" the relation- 
ship between the x and y variables. The variable X 
may be, for instance, the space of 20 x 20 binary 
matrices that represent alphabetic uppercase char- 
acters and Y would be the label set {1, ... , 27}. Sim- 
ilarly, X may be M 10,000 , the space corresponding to 
a document data base with a vocabulary of 10,000 
different words. In this case Y would be the set made 
up of a finite number of predefined semantic doc- 
ument classes, such as statistics, computer science, 
sociology and so forth. The main goal in this context 
usually is predictive accuracy, and in most cases it 
is not possible to assume a parametric form for the 
probability distribution p(x, y). Within this setting 
many practitioners concerned with providing prac- 
tical solutions to industrial data analysis problems 
put more emphasis on algorithmic modeling than on 
data models. However, a solely algorithmic point of 
view can lead to procedures with a black box be- 
havior, or even worse, with a poor response to the 
bias-variance dilemma. Neural networks constitute 
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a paradigmatic example of this approach. The (semi- 
parametric) model implemented by neural networks 
is powerful enough to approximate continuous func- 
tions with arbitrary precision. On the other hand, 
neural network parameters are very hard to tune 
and interpret, and statistical inference is usually not 
possible [51]. 

The SVMs provide a compromise between the para- 
metric and the pure nonparametric approaches: As 
in linear classifiers, SVMs estimate a linear deci- 
sion function, with the particularity that a previ- 
ous mapping of the data into a higher-dimensional 
feature space may be needed. This mapping is char- 
acterized by the choice of a class of functions known 
as kernels. The support vector method was intro- 
duced by Boser, Guyon and Vapnik [10] at the Com- 
putational Learning Theory (COLT92) ACM Con- 
ference. Their proposal subsumed into an elegant 
and theoretically well founded algorithm two semi- 
nal ideas, which had already individually appeared 
throughout previous years: the use of kernels and 
their geometrical interpretation, as introduced by 
Aizerman, Braverman and Rozonoer [1], and the 
idea of constructing an optimal separating hyper- 
plane in a nonparametric context, developed by Vap- 
nik and Chervonenkis [78] and by Cover [16]. The 
name "support vector" was explicitly used for the 
first time by Cortes and Vapnik [15]. In recent years, 
several books and tutorials on SVMs have appeared. 
A reference with many historical annotations is the 
book by Cristianini and Shawe- Taylor [20]. For a 
review of SVMs from a purely geometrical point 
of view, the paper by Bennett and Campbell [9] is 
advisable. An exposition of kernel methods with a 
Bayesian taste can be read in the book by Herbrich 
[30]. Concerning the statistical literature, the book 
by Hastie, Tibshirani and Friedman [28] includes a 
chapter dedicated to SVMs. 

We illustrate the basic ideas of SVMs for the two- 
group classification problem. This is the typical ver- 
sion and the one that best summarizes the ideas that 
underlie SVMs. The issue of discriminating more 
than two groups can be consulted, for instance, in [37]. 

Consider a classification problem where the dis- 
criminant function is nonlinear, as illustrated in Fig- 
ure 1(a). Suppose we have a mapping $ into a "fea- 
ture space" such that the data under consideration 
have become linearly separable as illustrated in Fig- 
ure 1(b). From the infinite number of existing sepa- 
rating hyperplanes, the support vector machine looks 
for the plane that lies furthermost from both classes, 



known as the optimal (maximal) margin hyperplane. 
To be more specific, denote the available mapped 
sample by {($(xj),yi)}f =1 , where m G {-1,+1} in- 
dicates the two possible classes. Denote by w T( l ) (x) + 
6 = any separating hyperplane in the space of the 
mapped data equidistant to the nearest point in 
each class. Under the assumption of separability, we 
can rescale w and b so that |w T( 3?(x) + b\ = 1 for 
those points in each class nearest to the hyperplane. 
Therefore, it holds that for every i G {1, . . . , n}, 

>1, if 2/, = +1 
<-l, if j/i = -1. 

After the rescaling, the distance from the nearest 
point in each class to the hyperplane is 1/|| w|| . Hence, 
the distance between the two groups is 2/||w|| , which 
is called the margin. To maximize the margin, the 
following optimization problem has to be solved: 



1.1) 
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(1.2) subject to (s.t.) 



y i (w J $(x i ) + 6)>l, 

i = l,...,n, 

where the square in the norm of w has been intro- 
duced to make the problem quadratic. Notice that, 
given its convexity, this optimization problem has 
no local minima. Consider the solution of problem 
(1.2), and denote it by w* and b*. This solution de- 
termines the hyperplane in the feature space -D*(x) = 
(w*) T $(x) + b* = 0. Points $(xj) that satisfy the 
equalities yj((w*) T <l>(xj) + &*) = 1 are called support 
vectors [in Figure 1(b) the support vectors are the 
black points]. As we will make clear later, the sup- 
port vectors can be automatically determined from 
the solution of the optimization problem. Usually 
the support vectors represent a small fraction of the 
sample, and the solution is said to be sparse. The 
hyperplane D*(x) = is completely determined by 
the subsample made up of the support vectors. This 
fact implies that, for many applications, the evalua- 
tion of the decision function D*(x) is computation- 
ally efficient, allowing the use of SVMs on large data 
sets in real-time environments. 

The SVMs are especially useful within ill-posed 
contexts. A discussion of ill-posed problems from 
a statistical point of view may be seen in [55]. A 
common ill-posed situation arises when dealing with 
data sets with a low ratio of sample size to dimen- 
sion. This kind of difficulty often comes up in prob- 
lems such as automatic classification of web pages 
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or microarrays. Consider, for instance, the follow- 
ing classification problem, where the data set is a 
text data base that contains 690 documents. These 
documents have been retrieved from the LISA (Li- 
brary Science Abstracts) and the INSPEC (biblio- 
graphic references for physics, computing and engi- 
neering research, from the IEE Institute) data bases, 
using, respectively, the search keywords "library sci- 
ence" (296 records) and "pattern recognition" (394 
records). We have selected as data points the terms 
that occur in at least ten documents, obtaining 982 
terms. Hence, the data set is given by a 982 x 690 
matrix, say T, where Tij = 1 if term i occurs in 
document j and = otherwise. For each term, 
we check the number of library science and pattern 
recognition documents that contain it. The highest 
value determines the class of the term. This proce- 
dure is standard in the field of automatic thesaurus 
generation (see [5]). The task is to check the perfor- 
mance of the SVM classifier in recovering the class 
labels obtained by the previous procedure. Notice 
that we are dealing with about 1000 points in nearly 
700 dimensions. We have divided the data set into a 
training set (80% of the data points) and a test set 
(20% of the data points). Since the sample is rela- 
tively small with respect to the space dimension, it 
should be easy for any method to find a criterion 
that separates the training set into two classes, but 
this does not necessarily imply the ability to cor- 
rectly classify the test data. 

The results obtained using Fisher linear discrimi- 
nant analysis (FLDA), the /c-nearest neighbor clas- 
sifier (fc-NN) with k = 1 and the linear SVM [i.e., 
taking <E> as the identity map &(x) = x] are shown 
in Table 1. 








fa} 



Fig. 1. (a) Original data in the input 



Table 1 

Classification percentage errors for a two-class text data base 



Method 


Training error 


Test error 


FLDA 


0.0% 


31.4% 


fc-NN (fc = l) 


0.0% 


14.0% 


Linear SVM 


0.0% 


3.0% 



It is apparent that the three methods have been 
able to find a criterion that perfectly separates the 
training data set into two classes, but only the lin- 
ear SVM shows good performance when classify- 
ing new data points. The best result for the fe-NN 
method (shown in the table) is obtained for k = 1, 
an unsurprising result, due to the "curse of dimen- 
sionality" phenomenon, given the high dimension of 
the data space. Regarding FLDA, the estimation 
of the mean vectors and covariance matrices of the 
groups is problematic given the high dimension and 
the small number of data points. The SVMs also 
calculate a linear hyperplane, but are looking for 
something different — margin maximization, which 
will only depend on the support vectors. In addi- 
tion, there is no loss of information caused by pro- 
jections of the data points. The successful behavior 
of the support vector method is not casual, since, 
as we will see below, SVMs are supported by regu- 
larization theory, which is particularly useful for the 
solution of ill-posed problems like the present one. 

In summary, we have just described the basics of 
a classification algorithm which has the following 
features: 

• Reduction of the classification problem to the com- 
putation of a linear decision function. 




ce. (b) Mapped data in the feature space. 
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• Absence of local minima in the SVM optimization 
problem. 

• A computationally efficient decision function (spar- 
se solution). 

In addition, in the next sections we will also discuss 
other important features such as the use of kernels 
as a primary source of information or the tuning of 
a very reduced set of parameters. 

The rest of the paper is organized as follows. Sec- 
tion 2 shows the role of kernels within the SVM ap- 
proach. In Section 3 SVMs are developed from the 
regularization theory perspective and some illustra- 
tive examples are given. Section 4 reviews a number 
of successful SVM applications to real-world prob- 
lems. In Section 5 algorithmic extensions of SVMs 
are presented. Finally, in Section 6 some open ques- 
tions and final remarks are presented. 

2. THE KERNEL MAPPING 

In this section we face one of the key issues of 
SVMs: how to use <3?(x) to map the data into a 
higher-dimensional space. This procedure is justi- 
fied by Cover's theorem [16], which guarantees that 
any data set becomes arbitrarily separable as the 
data dimension grows. Of course, finding such non- 
linear transformations is far from trivial. To achieve 
this task, a class of functions called kernels is used. 
Roughly speaking, a kernel K(x, y) is a real- valued 
function K : X x X — > K for which there exists a 
function <I> : X — > Z, where Z is a real vector space, 
with the property if (x,y) = <I>(x) T( I ) (y). This func- 
tion <3> is precisely the mapping in Figure 1. The 
kernel if(x,y) acts dot product in the space Z. 
In the SVM literature X and Z are called, respec- 
tively, input space and feature space (see Figure 1). 

As an example of such a K, consider two data 
points xi and X2, with Xj = (xn,Xi2) T G M 2 , and 
if (xi,x 2 ) = (1 + xf x 2 ) 2 = (1 + x u x 2 i + X12X22) 2 = 
^>(xi) T $(x 2 ), where $(xj) = (1, y/5xn, V2x i2 , x|, 
xf 2 ,V^xnXi2). Thus, in this example $:M 2 ^R 6 . 
As we will show later, explicit knowledge of both the 
mapping and the vector w will not be needed: we 
need only if in its closed form. 

To be more specific, a kernel if is a positive defi- 
nite function that admits an expansion of the form 
^(x,y) = E~iAi$i(x)$ i (y), where A, Suffi- 
cient conditions for the existence of such an expan- 
sion are given in Mercer's theorem [43]. The function 
if(x, y), known Mercer's kernel, implicitly de- 
fines the mapping <I> by letting ^(x) = (y / Ai < I ) i(x), 
^ 2 (x),...) T . 



Examples of Mercer's kernels are the linear kernel 
if(x, y) = x T y, polynomial kernels if (x, y) = (c + 
x T y) d and the Gaussian kernel if c (x, y) = e _ H x ~ y ll"/ c 
In the first case, the mapping is the identity. Poly- 
nomial kernels map the data into finite-dimensional 
vector spaces. With the Gaussian kernel, the data 
are mapped onto an infinite dimensional space Z = 
M°° (all the Aj 7^ in the kernel expansion; see [63] 
for the details). 

Given a kernel if, we can consider the set of func- 
tions spanned by finite linear combinations of the 
form /(x) = J2j a j-^( x j> x )> where the Xj S X. The 
completion of this vector space is a Hilbert space 
known as a reproducing kernel Hilbert space 
(RKHS) [3]. Since if(xj,x) = <£( Xi ) T $(x), the func- 
tions /(x) that belong to a RKHS can be expressed 
as /(x) = w T <£(x), with w = J2j a j^(' K j), that is, 
/(x) = describes a hyperplane in the feature space 
determined by $ [as the one illustrated in Figure 
1(b)]. Thus, reproducing kernel Hilbert spaces pro- 
vide a natural context for the study of hyperplanes 
in feature spaces through the use of kernels like those 
introduced in Section 1 . Without loss of generality, a 
constant b can be added to / (see [64] for a complete 
discussion), taking the form 

(2.1) /(x) = ^a i if(x i) x) + 6. 

3 

Equation (2.1) answers the question of how to use 
$(x) to map the data onto a higher-dimensional 
space: Since /(x) can be evaluated using expres- 
sion (2.1) [in which only the kernel values if(xj,x) 
are involved], $ acts implicitly through the closed 
form of if . In this way, the kernel function if is em- 
ployed to avoid an explicit evaluation of $ (often a 
high-dimensional mapping). This is the reason why 
knowledge of the explicit mapping $ is not needed. 

As we will show in the next section, SVMs work 
by minimizing a regularization functional that in- 
volves an empirical risk plus some type of penaliza- 
tion term. The solution to this problem is a function 
that has the form (2.1). This optimization process 
necessarily takes place within the RKHS associated 
with the kernel if. The key point in this computa- 
tion is the way in which SVMs select the weights 
ctj in (2.1) (the points Xj are trivially chosen as the 
sample data points Xj). A nice fact is that the esti- 
mation of these weights, which determine the deci- 
sion function in the RKHS, is reduced to the solution 
of a smooth and convex optimization problem. 
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3. SUPPORT VECTOR MACHINES: 
A REGULARIZATION METHOD 

In Section 1 we introduced the formulation of SVMs 
for the situation illustrated in Figure 1(b), where 
the mapped data have become linearly separable. 
We consider now the more general case where the 
mapped data remain nonseparable. This situation 
is illustrated in Figure 2(a). The SVMs address this 
problem by finding a function / that minimizes an 
empirical error of the form Ya=i L(yi, /(xj)), where 
L is a particular loss function and (xj,?/j)|L 1 is the 
available data sample. There may be an infinite num- 
ber of solutions, in which case the problem is ill- 
posed. Our aim is to show how SVMs make the 
problem well-posed. As a consequence, the decision 
function calculated by the SVM will be unique, and 
the solution will depend continuously on the data. 

The specific loss function L used within the SVM 
approach is L(y.j,/(xj)) = (1 - yif(xi))+, with 



(x)_|_ = max(x,0). This loss function is called hinge 
loss and is represented in Figure 3. It is zero for 
well classified points with |/(xj)| > 1 and is linear 
otherwise. Hence, the hinge loss function does not 
penalize large values of /(xj) with the same sign as 
yi (understanding large to mean |/(xj)| > 1). 

This behavior agrees with the fact that in classi- 
fication problems only an estimate of the classifica- 
tion boundary is needed. As a consequence, we only 
take into account points such that L(yj,/(xj)) > 
to determine the decision function. 

To reach well-posedness, SVMs make use of regu- 
larization theory, for which several similar approaches 
have been proposed [33, 60, 73]. The widest used set- 
ting minimizes Tikhonov's regularization function- 
al [73], which consists of solving the optimization 
problem 

1 n 

(3.1) mm -J2(l-yif(xi)) + + »\\f\\K, 




(a) (b) 

Fig. 3. Hinge loss function L(y it f (x*)) = (1 - 2/i/(xj)) + : (a) L(-l,/(xi)); (b) L(+l,/(xj)). 



6 



J. M. MOGUERZA AND A. MUNOZ 



where ft > 0, Hk is the RKHS associated with the 
kernel K, \\f\\K denotes the norm of / in the RKHS 
and Xj are the sample data points. Given that / be- 
longs to Hk, it takes the form /(•) = J2j a jK( x j> ")• 
As in Section 2, /(x) = is a hyperplane in the fea- 
ture space. Using the reproducing property {K(xj, •), 
K (xj, -)) K = K(xj,xi) (see [3]), it holds that ||/|||- = 
(f,f)K = T,jT,i ajaiK(xj,xi). 

In (3.1) the scalar fi controls the trade-off be- 
tween the fit of the solution / to the data (measured 
by L) and the approximation capacity of the func- 
tion space that / belongs to (measured by H/Hif). 
It can be shown [11, 48] that the space where the 
solution is sought takes the form {/ S Hk ■ \\f\\jc — 
(sup yeY L(y,0))//i}, a compact ball in the RKHS. 
Note that the larger ft is, the smaller is the ball 
and the more restricted is the search space. This 
is the way in which regularization theory imposes 
compactness in the RKHS. Cucker and Smale [21] 
showed that imposing compactness on the space as- 
sures well-posedness of the problem and, thus, unique- 
ness of the solution (refer to the Appendix for de- 
tails). 

The solution to problem (3.1) has the form /(x) = 
Ya=i &iK(xi,x.) + b, where Xj are the sample data 
points, a particular case of (2.1). This result is known 
as the representer theorem. For details, proofs and 
generalizations, refer to [36, 67] or [18]. It is immedi- 
ate to show that \\f\\ 2 K = ||w|| 2 , where w = 
Y^i oii&(yii). Given this last result, problem (3.1) can 
be restated as 

1 n 

(3.2) min - V(l - yi (w T 3>( Xi ) + &)) + /i||w|| 2 . 
w,6 n ~ '. 

It is worth mentioning that the second term in (3.2) 
coincides with the term in the objective function of 
(1.2). Problems (3.1) and (3.2) review some of the 
key issues of SVMs enumerated at the end of Section 
1: Through the use of kernels, the a priori problem of 
estimating a nonlinear decision function in the input 
space is transformed into the a posteriori problem of 
estimating the weights of a hyperplane in the feature 
space. 

Because of the hinge loss function, problem (3.2) 
is nondifferentiable. This lack of differentiability im- 
plies a difficulty for efficient optimization techniques; 
see [7] or [47]. Problem (3.2) can be turned smooth 
by straightforwardly formulating it as (see [41]) 

n 

min i|| w || 2 + C^& 



s.t. yi (w T <5>{xi) + b) > 1-6, 

(3.3) i = 1, ... ,n, 

£j>0, i = l,...,n, 

where are slack variables introduced to avoid the 
nondifferentiability of the hinge loss function and 
C = 1/(2 fin). This is the most widely used SVM for- 
mulation. 

The slack variables £j allow violations of constraints 

(1.1) , extending problem (1.2) to the nonseparable 
case [problem (1.2) would not be solvable for non- 
separable data]. The slack variables guarantee the 
existence of a solution. The situation is shown in 
Figure 2(b), which constitutes a generalization of 
Figure 1(b). Notice that problem (1.2) is a partic- 
ular case of problem (3.3). To be more specific, if 
the mapped data become separable, problem (1.2) 
is equivalent to problem (3.3) when, at the solu- 
tion, £j = 0. Intuitively, we want to solve problem 

(1.2) and, at the same time, minimize the number 
of nonseparable samples, that is, J2i #(& > 0)- Since 
the inclusion of this term would provide a nondif- 
ferentiable combinatorial problem, the smooth term 
Ya=\ £i appears instead. 

We have deduced the standard SVM formulation 

(3.3) via the use of regularization theory. This frame- 
work guarantees that the empirical error for SVMs 
converges to the expected error as re — > oo [21], that 
is, the decision functions obtained by SVMs are sta- 
tistically consistent. Therefore, the separating hy- 
perplanes obtained by SVMs are neither arbitrary 
nor unstable. This remark is pertinent since Cover's 
theorem (which guarantees that any data set be- 
comes arbitrarily separable as the data dimension 
grows) could induce some people to think that SVM 
classifiers are arbitrary. 

By standard optimization theory, it can be shown 
that problem (3.3) is equivalent to solving 

n n n 
m A in \ E E A,A ; //;// ; /\ ix,.x ; ;■ 

i=lj=l i=l 

n 

(3.4) s.t. ^ yi A 4 = 0, 

i=X 

0<Ai<C, i = l,...,n. 

The Aj variables are the Lagrange multipliers asso- 
ciated with the constraints in (3.3). This problem 
is known in optimization theory as the dual prob- 
lem of (3.3) [7]. It is convex and quadratic and, 
therefore, every local minimum is a global minimum. 
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In practice, this is the problem to solve, and effi- 
cient methods specific for SVMs have been devel- 
oped (see [34, 58, 61]). 

Let the vector A* denote the solution to prob- 
lem (3.4). Points that satisfy A* > are the support 
vectors (shown in black in Figure 2(b) for the non- 
separable case). It can be shown that the solution 
to problem (3.3) is w* = Ya=i KVi®( x i) and 



b* 



Er=iA*y i K(x i ,x- 



(3.5) 



+ 



Er=iA*y^(x J ,x-) 



where x + and x~ are, respectively, two support vec- 
tors in classes +1 and —1 such that their associ- 
ated Lagrange multipliers A + and A - hold so that 
< A+ < C and < A" < C. 

The desired decision function, which determines 
the hyperplane (w*) T< 3?(x) + b* = 0, takes the form 

D*(x) = (w*) T $(x) + £>* 

(3.6) 

= VjA; W A:(x ( ,x) + i.*. 

i=l 

Equations (3.5) and (3.6) show that -D*(x) is com- 
pletely determined by the subsample made up by 
the support vectors, the only points in the sample 
for which A* ^ 0. This definition of support vec- 
tor is coherent with the geometrical one given in 
Section 1. The reason is that Lagrange multipliers 
A* must fulfill the strict complementarity conditions 
(see [7]), that is, A*(-D*(Xj) - 1 + = 0, where ei- 
ther A* = or D*(xi) = 1 - & . Therefore, if A* / 0, 
then D*(xi) = 1 — £j and Xj is one of the points 
that defines the decision hyperplane [one of the black 
points in Figure 2(b)]. Often the support vectors are 
a small fraction of the data sample and, as already 
mentioned, the solution is said to be sparse. This 
property is due to the use of the hinge loss function. 

Note that problem (3.4) and equation (3.6) de- 
pend only on kernel evaluations of the form K(x, y). 
Therefore, the explicit mapping $ is not needed to 
solve the SVM problem (3.4) or to evaluate the deci- 
sion hyperplane (3.6). In particular, even when the 
kernel corresponds to an infinite-dimensional space 
(for instance, the Gaussian kernel), there is no prob- 
lem with the evaluation of w* = Ya=i A*2/i^(xj), 
which is not explicitly needed. In practice, D*(x) 
is evaluated using the right-hand side of equation 
(3.6). 



3.1 SVMs and the Optimal Bayes Rule 

The results in the previous section are coherent 
with the ones obtained by Lin [40] , which state that 
the support vector machine classifier approaches the 
optimal Bayes rule and its generalization error con- 
verges to the optimal Bayes risk. 

Consider a two-group classification problem with 
classes +1 and —1 and, to simplify, assume equal 
costs of misclassification. Under this assumption, 
the expected misclassification rate and the expected 
cost coincide. Let pi(x) = P(Y = +1\X = x), where 
X and Y are two random variables whose joint dis- 
tribution is p(x, y) . The optimal Bayes rule for the 
minimization of the expected misclassification rate 



is 



(3.7) BR(x) 



+1, 
-1, 



x > 



if Pi I 
if pi(x) < 



On one hand, from the previous section we know 
that the minimization of problem (3.1) guarantees 
(via regularization theory) that the empirical risk 
^ Ya=i(^ ~ Vif{ x i))+ converges to the expected er- 
ror E[(l — Y f(x)) + ]. On the other hand, in [40] it is 
shown that the solution to the problem minj E[(l — 
Yf(x)) + ] is /*(x) = sign(pi(x) - 1/2), an equiva- 
lent formulation of (3.7). Therefore, the minimizer 
sought by SVMs is exactly the Bayes rule. 

In [41] it is pointed out that if the smoothing 
parameter fi in (3.1) is chosen appropriately and 
the approximation capacity of the RKHS is large 
enough, then the solution to the SVM problem (3.2) 
approaches the Bayes rule as n — > oo. For instance, 
in the two examples shown in the next subsection, 
where the linear kernel K(x, y) = x r y is used, the 
associated RKHS (made up of linear functions) is 
rich enough to solve the classification problems. A 
richer RKHS should be used for more complex deci- 
sion surfaces (see [41]), for instance, the one induced 
by the Gaussian kernel or those induced by high de- 
gree polynomial kernels. Regarding the choice of /x, 
methods to determine it in an appropriate manner 
have been proposed by Wahba [79, 80, 82]. 

3.2 Illustrating the Performance with 
Simple Examples 

In this first example we consider a two-class sepa- 
rable classification problem, where each class is made 
up of 1000 data points generated from a bivariate 
normal distribution N(fj,i,I), with hi = (0,0) and 
Hi = (10,10). Our aim is to illustrate the perfor- 
mance of the SVM in a simple example and, in 
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particular, the behavior of the algorithm for dif- 
ferent values of the regularization parameter C in 
problem (3.3). The identity mapping ^(x) = x is 
used. Figure 4(a) illustrates the result for C = 1 (for 
C > 1, the same result is obtained). There are ex- 
actly three support vectors and the optimal mar- 
gin separating hyperplane obtained by the SVM is 
1.05x + l.OOy - 10.4 = 0. For C = 0.01, seven sup- 
port vectors are obtained [see Figure 4(b)], and the 
discriminant line is 1.02x + l.OOy — 10.4 = 0. For C = 
0.00001, 1776 support vectors are obtained [88.8% 
of the sample; see Figure 4(c)] and the separating 
hyperplane is l.OOx + l.OOy — 13.0 = 0. The three 
hyperplanes are very similar to the (normal theory) 
linear discriminant function l.OOx + l.OOy — 10.0 = 0. 
Notice that the smaller C is, the larger the number 
of support vectors. This is due to the fact that, in 
problem (3.3), C penalizes the value of the & vari- 
ables, which determine the width of the band that 
contains the support vectors. 

This second example is quite similar to the pre- 
vious one, but the samples that correspond to each 
class are not separable. In this case the mean vectors 
of the two normal clouds (500 data points in each 
group) are fii = (0,0) and \i2 = (4,0), respectively. 
The theoretical Bayes error is 2.27%. The normal 
theory (and optimal) separating hyperplane is x = 2, 
that is, 0.5x + Oy - 1 = 0. The SVM estimated hy- 
perplane (taking C = 2) is 0.497a; - 0.001?/ -1 = 0. 
The error on a test data set with 20,000 data points 
is 2.3%. Figure 4(d) shows the estimated hyperplane 
and the support vectors (the black points), which 
represent 6.3% of the sample. To show the behav- 
ior of the method when the parameter C varies, 
Figure 4(e) shows the separating hyperplanes for 
30 SVMs that vary C from 0.01 up to 10. All of 
them look very similar. Finally, Figure 4(f) shows 
the same 30 hyperplanes when two outlying points 
(enhanced in black) are added to the left cloud. 
Since the estimated SVM discriminant functions de- 
pend only on the support vectors, the hyperplanes 
remain unchanged. 

3.3 The Waveform Data Set 

We next illustrate the performance of SVMs on 
a well-known three-class classification example con- 
sidered to be a difficult pattern recognition prob- 
lem [28], the waveform data set introduced in [13]. 
For the sake of clarity, we reproduce the data de- 
scription. Each class is generated from a random 



convex combination of two of three triangular wave- 
forms, namely, h\(i) = max(6 — \i — 11|,0), li2{i) = 
h\(i — 4) and hz(i) = h\{i + 4), sampled at the inte- 
gers i G {1, . . . , 21}, plus a standard Gaussian noise 
term. Thus, each data point is represented by x = 
(xi, . . . , X21), where each component is defined by 

Xi = uh\{i) + (1 — u)h,2(i) + Si, for Class 1, 

xi = uh\{i) + (1 — u)h^{i) + £j, for Class 2, 

xi = ufi2(i) + (1 — u)h^{i) + £j, for Class 3, 

with u ~ U(0, 1) and £i ~ -/V(0, 1). A nice picture of 
sampled waveforms can be found on page 404 of [28] . 
The waveform data base [available from the UCI 
repository (data sets available from the University of 
California, Irvine, at http://kdd.ics.uci.edu/)] 
contains 5000 instances generated using equal prior 
probabilities. In this experiment we have used 400 
data values for training and 4600 for test. Breiman, 
Friedman, Olshen and Stone [13] reported a Bayes 
error rate of 14% for this data set. Since we are 
handling three groups, we use the "one-against-one" 
approach, in which (2) binary SVM classifiers are 
trained and the predicted class is found by a voting 
scheme: each classifier assigns to each datum a class, 
being the data point assigned to its most voted class 
[37]. A first run over ten simulations of the experi- 
ment using C = 1 in problem (3.3) and the Gaussian 
kernel K(x,y) = e~" x ~ y " Z 200 gave an error rate of 
14.6%. To confirm the validity of the result, we have 
run 1000 replications of the experiment. The average 
error rate over the 1000 simulations on the training 
data was 10.87% and the average error rate on the 
test data was 14.67%. The standard errors of the av- 
erages were 0.004 and 0.005, respectively. This result 
improves any other described in the literature to our 
knowledge. For instance, the best results described 
in [28] are provided by FLDA and Fisher FDA (flexi- 
ble discriminant analysis) with MARS (multivariate 
adaptive regression splines) as the regression proce- 
dure (degree = 1), both achieving a test error rate of 
19.1%. Figure 5 shows a principal component analy- 
sis (PCA) projection of the waveform data into two 
dimensions with the misclassified test data points 
(marked in black) for one of the SVM simulations. 

4. FURTHER EXAMPLES 

In this section we will review some well-known ap- 
plications of SVMs to real- world problems. In par- 
ticular, we will focus on text categorization, bioin- 
formatics and image recognition. 
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Fig. 5. A PC A projection of the waveform data. The black 
points represent the misclassified data points using an SVM 
with the Gaussian kernel. 

Text categorization consists of the classification of 
documents into a predefined number of given cate- 
gories. As an example, consider the document col- 
lection made up of Usenet News messages. They 
are organized in predefined classes such as compu- 



tation, religion, statistics and so forth. Given a new 
document, the task is to conduct the category as- 
signment in an automatic way. Text categorization 
is used by many Internet search engines to select 
Web pages related to user queries. Documents are 
represented in a vector space of dimension equal 
to the number of different words in the vocabu- 
lary. Therefore, text categorization problems involve 
high-dimensional inputs and the data set consists of 
a sparse document by term matrix. A detailed treat- 
ment of SVMs for text categorization can be found 
in [34]. The performance of SVMs in this task will 
be illustrated on the Reuters data base. This is a 
text collection composed of 21,578 documents and 
118 categories. The data space in this example has 
dimension 9947, the number of different words that 
describe the documents. The results obtained using 
a SVM with a linear kernel are consistently better 
along the categories than those obtained with four 
widely used classification methods: naive Bayes [24], 
Bayesian networks [29], classification trees [13] and 
^-nearest neighbors [17]. The average rate of suc- 
cess for SVMs is 87% while for the mentioned meth- 
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ods the rates are 72%, 80%, 79% and 82%, respec- 
tively (see [34] and [25] for further details). However, 
the most impressive feature of SVM text classifiers 
is their training time: SVMs are four times faster 
than the naive Bayes classifier (the fastest of the 
other methods) and 35 times faster than classifica- 
tion trees. This performance is due to the fact that 
SVM algorithms take advantage of sparsity in the 
document by term matrix. Note that methods that 
involve the diagonalization of large and dense ma- 
trices (like the criterion matrix in FLDA) are out of 
consideration for text classification because of their 
expensive computational requirements. 

We next outline some SVM applications in bioin- 
formatics. There is an increasing interest in ana- 
lyzing microarray data, that is, analyzing biologi- 
cal samples using their genetic expression profiles. 
The SVMs have been applied recently to tissue clas- 
sification [26], gene function prediction [59], pro- 
tein subcellular location prediction [31], protein sec- 
ondary structure prediction [32] and protein fold 
prediction [23], among other tasks. In almost all 
cases, SVMs outperformed other classification meth- 
ods and in their worst case, SVM performance is 
at least similar to the best non-SVM method. For 
instance, in protein subcellular location prediction 
[31], we have to predict protein subcellular positions 
from prokaryotic sequences. There are three pos- 
sible location categories: cytoplasmic, periplasmic 
and extracellular. From a pure classification point of 
view, the problem reduces to classifying 20-dimen- 
sional vectors into three (highly unbalanced) classes. 
Prediction accuracy for SVMs (with a Gaussian ker- 
nel) amounts to 91.4%, while neural networks and a 
first-order Markov chain [75] have accuracy of 81% 
and 89.1%, respectively. The results obtained are 
similar for the other problems. It is important to 
note that there is still room for improvement. 

Regarding image processing, we will overview two 
well-known problems: handwritten digit identifica- 
tion and face recognition. With respect to the first 
problem, the U.S. Postal Service data base contains 
9298 samples of digits obtained from real-life zip 
codes (divided into 7291 training samples and 2007 
samples for testing). Each digit is represented by a 
16 x 16 gray level matrix; therefore each data point 
is represented by a vector in M 256 . The human clas- 
sification error for this problem is known to be 2.5% 
[22]. The error rate for a standard SVM with a third 
degree polynomial kernel is 4% (see [22] and ref- 
erences therein), while the best known alternative 



method, the specialized neural network LeNetl [39], 
achieves an error rate of 5%. For this problem, using 
a specialized SVM with a third degree polynomial 
kernel [22] lowers the error rate to 3.2% — close to 
the human performance. The key to this specializa- 
tion lies in the construction of the decision function 
in three phases: in the first phase, a SVM is trained 
and the support vectors are obtained; in the sec- 
ond phase, new data points are generated by trans- 
forming these support vectors under some groups of 
transformations, rotations and translations. In the 
third phase, the final decision hyper plane is built by 
training a SVM with the new points. 

Concerning face recognition, gender detection has 
been analyzed by Moghaddam and Yang [45]. The 
data contain 1755 face images (1044 males and 711 fe- 
males), and the overall error rate for a SVM with a 
Gaussian kernel is 3.2% (2.1% for males and 4.8% 
for females). The results for a radial basis neural 
network [63], a quadratic classifier and FLDA are, 
respectively, 7.6%, 10.4% and 12.9%. 

Another outstanding application of SVMs is the 
detection of human faces in gray- level images [56]. 
The problem is to determine in an image the loca- 
tion of human faces and, if there are any, return an 
encoding of their position. The detection rate for 
a SVM using a second degree polynomial kernel is 
97.1%, while for the best competing system the rate 
is 94.6%. A number of impressive photographs that 
show the effectiveness of this application for face lo- 
cation can be consulted in [57]. 

5. EXTENSIONS OF SVMS: SUPPORT 
VECTOR REGRESSION 

It is natural to contemplate how to extend the ker- 
nel mapping explained in Section 2 to well-known 
techniques for data analysis such as principal com- 
ponent analysis, Fisher linear discriminant analysis 
and cluster analysis. In this section we will describe 
support vector regression, one of the most popular 
extensions of support vector methods, and give some 
references regarding other extensions. 

The ideas underlying support vector regression 
are similar to those within the classification scheme. 
From an intuitive viewpoint, the data are mapped 
into a feature space and then a hyperplane is fitted 
to the mapped data. From a mathematical perspec- 
tive, the support vector regression function is also 
derived within the RKHS context. In this case, the 
loss function involved is known as the e-insensitive 
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loss function (see [76]), which is defined as 
i(j/t,/(xi)) = (|/(xj) - ml - e)+, e>0. This loss 
function ignores errors of size less than e (see Fig- 
ure 6). A discussion of the relationship of the e- 
insensitive loss function and the ones used in ro- 
bust statistics can be found in [28]. Using this loss 
function, the following optimization problem, simi- 
lar to (3.1) (also consisting of the minimization of a 
Tikhonov regularization functional), arises: 

1 n 

(5.1) min -Y^(\f(*i)-Vi\-e) + + l*\\f\\K, 
f K n i=1 

where fi > 0, Hjc is the RKHS associated with the 
kernel K, \\f\\K denotes the norm of / in the RKHS 
and (xj,yj) are the sample data points. 

Once more, by the representer theorem, the solu- 
tion to problem (5.1) has the form /(x) = Ya=i a i x 
K(x.i,x) + 6, where Xj are the sample data points. 
It is immediate to show that \\f\W = ||w|| 2 , where 
w = a i^( x i) an d 3> is the mapping that defines 
the kernel function. Thus, problem (5.1) can be re- 
stated as 

1 n 

(5.2) min- V(|w T $(xi) + 6-^1 -e) + +/i||w|| 2 . 
w,fe n r - ? 

Since the e-insensitive loss function is nondifferen- 
tiable, this problem has to be formulated so that it 
can be solved by appropriate optimization methods. 
Straightforwardly, the equivalent (convex) problem 
to solve is 

n 

min i||w|| 2 + C$> + £0 

s.t. (w T $(x;) + 6) + 
i = l,...,n, 







- 1 


+ t 


lf(x) - y 



Fig. 6. The e-insensitive loss function L(yi,/(xi)) = 
(|/(xi)-w|-e)+, e>0. 



(5.3) 

y i -(w T <S>(x l ) + b)<e + &, 

i = l,...,n, 

&,£i>0, i = l,...,n, 

where C = 1/(2 fin). Notice that e appears only in 
the constraints, forcing the solution to be calculated 
by taking into account a confidence band around the 
regression equation. The & and ^ are slack variables 
that allow for some data points to stay outside the 
confidence band determined by e. This is the stan- 
dard support vector regression formulation. Again, 
the dual of problem (5.3) is a convex quadratic opti- 
mization problem, and the regression function takes 
the same form as equation (2.1). For a detailed ex- 
position of support vector regression, refer to [71] 
or [69]. 

One of the most popular applications of support 
vector regression concerns load forecasting, an im- 
portant issue in the power industry. In 2001 a pro- 
posal based on SVMs for regression was the winner 
of the European Network of Excellence on Intelligent 
Technologies competition. The task was to supply 
the prediction of maximum daily values of electrical 
loads for January 1999 (31 data values altogether). 
To this aim each challenger was given half an hour 
loads, average daily temperatures and the holidays 
for the period 1997-1998. The mean absolute per- 
centage error for daily data using the SVM regres- 
sion model was about 2%, significantly improving 
the results of most competition proposals. It is im- 
portant to point out that the SVM procedure used 
in the contest was standard, in the sense that no 
special modifications were made for the particular 
problem at hand. See [14] for further details. 

Many other kernel methods have been proposed in 
the literature. To name a few, there are extensions 
to PCA [70], Fisher discriminant analysis [6, 44], 
cluster analysis [8, 46], partial least squares [66], 
time series analysis [50], multivariate density esti- 
mation [49, 54, 68], classification with asymmetric 
proximities [52], combination with neural network 
models [53] and Bayesian kernel methods [74]. 

6. OPEN ISSUES AND FINAL REMARKS 

The underlying model implemented in SVMs is 
determined by the choice of the kernel. Deciding 
which kernel is the most suitable for a given appli- 
cation is obviously an important (and open) issue. 
A possible approach is to impose some restrictions 
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directly on the structure of the classification (or re- 
gression) function / implemented by the SVM. A 
way to proceed is to consider a linear differential op- 
erator D, and choose K as the Green's function for 
the operator D*D, where D* is the adjoint operator 
of D [4]. It is easy to show that the penalty term 
\\f\\ 2 K equals ||£>/||| 2 . Thus, the choice of the dif- 
ferential operator D imposes smoothing conditions 
on the solution /. This is also the approach used in 
functional data analysis [65]. For instance, if D*D 
is the Laplacian operator, the kernels obtained are 
harmonic functions. The simplest case corresponds 
to (see, e.g., [35]) K(x,y) = x T y + c, where c is a 
constant. Another interesting example is the Gaus- 
sian kernel. This kernel arises from a differential op- 
erator which penalizes an infinite sum of derivatives. 
The details for its derivation can be found in [63]. 

A different approach is to build a specific kernel 
directly for the data at hand. For instance, Wu and 
Amari [83] proposed the use of differential geome- 
try methods [2] to derive kernels that improve class 
separation in classification problems. 

An alternative research line arises when a bat- 
tery of different kernels is available. For instance, 
when dealing with handwriting recognition, there 
are a number of different (nonequivalent) metrics 
that provide complementary information. The task 
here is to derive a single kernel which combines the 
most relevant features of each metric to improve the 
classification performance (see, e.g., [38] or [42]). 

Regarding more theoretical questions, Cucker and 
Smale [21], as already mentioned, provided sufficient 
conditions for the statistical consistency of SVMs 
from a functional analysis point of view (refer to 
the Appendix for the details). On the other hand, 
the statistical learning theory developed by Vap- 
nik and Chervonenkis (summarized in [77] ) provides 
necessary and sufficient conditions in terms of the 
Vapnik-Chervonenkis (VC) dimension (a capacity 
measure for functions). However, the estimation of 
the VC dimension for SVMs is often not possible and 
the relationship between both approaches is still an 
open issue. 

From a statistical point of view an important sub- 
ject remains open: the interpretability of the SVM 
outputs. Some (practical) proposals can be consulted 
in [62, 76] and [72] about the transformation of the 
SVM classification outputs into a posteriori class 
probabilities. 

Regarding the finite sample performance of SVMs, 
a good starting point can be found in [55], where 



bias and variability computations for linear inver- 
sion algorithms (a particular case of regularization 
methods) are studied. The way to extend these ideas 
to the SVM nonlinear case is an interesting open 
problem. 

Concerning software for SVMs, a variety of imple- 
mentations are freely available from the Web, most 
reachable at http://www.kernel-machines.org/. 
In particular, Matlab toolboxes and R/Splus libraries 
can be downloaded from this site. Additional infor- 
mation on implementation details concerning SVMs 
can be found in [20] and [69]. 

As a final proposal, a novice reader could find 
it interesting to review a number of other regular- 
ization methods, such as penalized likelihood meth- 
ods [27], classification and regression with Gaussian 
processes [72, 82], smoothing splines [81], functional 
data analysis [65] and kriging [19]. 

APPENDIX: STATISTICAL CONSISTENCY OF 
THE EMPIRICAL RISK 

When it is not possible to assume a parametric 
model for the data, ill-posed problems arise. The 
number of data points which can be recorded is 
finite, while the unknown variables are functions 
which require an infinite number of observations for 
their exact description. Therefore, finding a solution 
implies a choice from an infinite collection of alter- 
native models. A problem is well-posed in the sense 
of Hadamard if (1) a solution exists; (2) the solution 
is unique; (3) the solution depends continuously on 
the observed data. A problem is ill-posed if it is not 
well-posed. 

Inverse problems constitute a broad class of ill- 
posed problems [73]. Classification, regression and 
density estimation can be regarded as inverse prob- 
lems. In the general setting, we consider a mapping 

Hi — > H2 , where H± represents a metric function 
space and H2 represents a metric space in which the 
observed data (which could be functions) live. For 
instance, in a linear regression problem, H\ corre- 
sponds to the finite-dimensional vector space R fc+1 , 
where k is the number of regressors; H2 is W 1 , where 
n is the number of data points; and A is the linear 
operator induced by the data matrix of dimension 
n x [k + 1). Let y = (yi, ■ . ■ ,y n ) be the vector of 
response variables and denote by / the regression 
equation we are looking for. Then the regression 
problem consists of solving the inverse problem Af = 
y. A similar argument applies to the classification 
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setting. In this case, the y values live in a compact 
subset of the H% space [77] . 

An example of an inverse problem in which H2 is a 
function space is the density estimation one. In this 
problem H\ and H2 are both function spaces and A 
is a linear integral operator given by (A/)(x) = 
/ K(x,y)f(y)dy, where K is a predetermined ker- 
nel function and / is the density function we are 
seeking. The problem to solve is Af = F, where F 
is the distribution function. If F is unknown, the 
empirical distribution function F n is used instead, 
and the inverse problem to solve is Af = y, with 

y = F n . 

We will focus on classification and regression tasks. 
Therefore, we assume there exist a function / : X — ► 
Y and a probability measure p defined in X x Y so 
that -Efj/jx] = /(x). For an observed sample {(xj, y,) 6 
X x Y}f =1 , the goal is to obtain the "best" possible 
solution to Af = y, where, as mentioned above, y is 
the ra-dimensional vector of y^s and A is an opera- 
tor that depends on the Xj values. To evaluate the 
quality of a particular solution, a "loss function" 
L(f;x,y) has to be introduced, which we will de- 
note L(y, /(x)) in what follows. A common example 
of a loss function for regression is the quadratic loss 
L(y,/(x)) = (y-/(x)) 2 . 

Consider the Banach space C(X) of continuous 
functions on X with the norm 1 1 / 1 1 00 = sup xe x I / ( x ) I • 
The solution to the inverse problem in each case is 
the minimizer /* of the risk functional R{f) : C{X) — 
R defined by (see [21]) 

(A.l) R(f)= f L(y,f(x))p(x,y)dxdy. 

JXxY 

Of course, the solution depends on the function space 
in which / lives. Following [21], the hypothesis space, 
denoted by TL in the sequel, is chosen to be a com- 
pact subset of C(X). In particular, only bounded 
functions / : X — > Y are considered. 

In these conditions, and assuming a continuous 
loss function L, Cucker and Smale [21] proved that 
the functional R(f) is continuous. The existence of 
/* = argminyg-ft R(f) follows from the compactness 
of TL and the continuity of R(f)- In addition, if TL is 
convex, /* will be unique and the problem becomes 
well-posed. 

In practice, it is not possible to calculate R(f) 
and the empirical risk R n (f) = £ Y2=i L (Vi, /( x 0) 
must be used. This is not a serious complication 
since asymptotic uniform convergence of R n (f) to 
the risk functional R(f) is a proven fact (see [21]). 



In summary, imposing compactness on the hy- 
pothesis space assures well-posedness of the problem 
to be solved and uniform convergence of the empir- 
ical error to the risk functional for a broad class 
of loss functions, including the square loss and loss 
functions used in the SVM setting. 

The question of how to impose compactness on 
the hypothesis space is fixed by regularization the- 
ory. A possibility (followed by SVMs) is to minimize 
Tikhonov's regularization functional 

(A.2) mmif>( 1/i ,/(x i )) + Afi(/) ) 

where A > 0, H is an appropriate function space, and 
f2(/) is a convex positive functional. By standard 
optimization theory arguments, it can be shown that, 
for fixed A, the inequality < C holds for a con- 
stant C > 0. Therefore, the space where the solution 
is searched takes the form TL = {/ £ H : f2(/) < C}, 
that is, a convex compact subset of H. 
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